Skip to main content

Reply to this post | Go Back
View Post [edit]

Poster: Tagishsimon Date: Sep 19, 2011 4:23am
Forum: texts Subject: Inconsistent author names

Apologies if this ground has been covered before...

...Is there any way to improve the text catalogue in respect of author name consistency. I regularly enough come across authors with name variants - such as:

* "George Tate" - http://www.archive.org/details/historyboroughc00tategoog - and
* "Tate, George, 1805-1871" - http://www.archive.org/details/historyboroughc01tategoog -

which, what with the OCD and all, drives me mad. Can anything be done?

Reply to this post
Reply [edit]

Poster: stbalbach Date: Sep 20, 2011 3:18pm
Forum: texts Subject: Re: Inconsistent author names

The problem you mention is more than just cosmetic. When building targeted searches, one has to be aware of the many ways an authors name might be in the database. This requires complex search strings - so complex in fact the string actually can exceed 1024 characters (or whatever the max is) making it impossible to search for all books by an individual author (see below for an example). I've asked info@archive.org a number of times if they could simply increase the max search string length but never received a reply, it seems to me a rather trivial thing to do, just increase the variable size and recompile (assuming there's not already a cfg file for it).

--

With that said, it helps to understand some things about IA in order to be more forgiving of its limitations.

1. It is not a company, but a non-profit. It has limited staff and resources. It runs triage with those resources ie. there is more work than resources available. I don't always agree with those priorities, but in the end they add a lot of new books every day! That's the most important thing.

2. The metadata is entered by various entities. Maybe one book was entered by Microsoft, another by the University of Michigan, another by John Smith a user who uploaded it on his spare time, another by the Federal Govt. Maybe the data was imported from an old database, maybe it was created fresh just for IA, maybe it was done 10 years ago. There is not a single person or entity responsible for making sure the data is consistent.

3. There is no direct way for end-users to modify the metadata at archive.org (other than notes in the review field). But there is openlibrary.org which is sort of a Wikipedia-like interface to Internet Archive which anyone can edit.

---

Example search string to catch many occurrences of Robert Louis Stevenson in the database. Note: this search sting is almost at the maximum length allowed, it would be easy to construct a search string longer than this that would kick back an error.. but it's the type of searches needed for the complex nature of this database.

mediatype:(texts) (subject:"Stevenson, Robert Louis, 1850-1894" OR subject:"Stevenson, R. L. (Robert Louis), 1850-1894" OR subject:"Stevenson, Robert L. (Robert Louis), 1850-1894" OR subject:"Stevenson, Robert Louis" OR subject:"Stevenson, R. L. (Robert Louis)" OR subject:"Stevenson, Robert L. (Robert Louis)" OR subject:"Robert Louis Stevenson" OR subject:"Robert L. Stevenson" OR subject:"R. L. Stevenson" OR creator:"Stevenson, Robert Louis, 1850-1894" OR creator:"Stevenson, Robert Louis, Sir, 1850-1894" OR creator:"Stevenson, R. L. (Robert Louis), 1850-1894" OR creator:"Stevenson, Robert L. (Robert Louis), 1850-1894" OR creator:"Stevenson, Robert Louis" OR creator:"Stevenson, R. L. (Robert Louis)" OR creator:"Stevenson, Robert L. (Robert Louis)" OR creator:"Robert Louis Stevenson" OR creator:"Robert L. Stevenson" OR creator:"R. L. Stevenson" OR title:"Robert Louis Stevenson" OR title:"Robert L. Stevenson" OR title:"R. L. Stevenson" OR description:"Robert Louis Stevenson" OR description:"Robert L. Stevenson" OR description:"R. L. Stevenson" OR description:"Stevenson, Robert Louis" OR description:"Stevenson, R. L. (Robert Louis)" OR description:"Stevenson, Robert L. (Robert Louis)")

Reply to this post
Reply [edit]

Poster: martyveldman Date: Sep 20, 2011 4:24pm
Forum: texts Subject: Re: Inconsistent author names

I use:
Stevenson "Robert Louis"
That gets 1,547 results, including Google Books.

Reply to this post
Reply [edit]

Poster: garthus Date: Sep 20, 2011 5:08pm
Forum: texts Subject: Re: Inconsistent author names

And the string gets 1581; not much of a difference.

Gerry

Reply to this post
Reply [edit]

Poster: stbalbach Date: Sep 20, 2011 7:01pm
Forum: texts Subject: Re: Inconsistent author names

True 2.2% missing seems not bad .. but I bet if you scrolled through the 1547 set, you'll find books that don't belong, while books that should be there are not included, making that missing percent larger. Like, what about books under "R.L. Stevenson", they would not show up in the 1547 set, other examples like that. It's the problem of inconsistent author names that require customized searches that can be longer than search strings allow.

Reply to this post
Reply [edit]

Poster: garthus Date: Sep 21, 2011 3:10pm
Forum: texts Subject: Re: Inconsistent author names

I guess eventually this can and will be addressed, probably through Open Library, unless editing of subject fields is allowed.

Gerry